Technical Report: MapReduce-based Similarity Joins

نویسندگان

Yasin N. Silva

Jason M. Reed

Lisa M. Tsosie

چکیده

Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller than a predefined threshold ε. Even though multiple algorithms and implementation techniques have been proposed for Similarity Joins, very little work has addressed the study of Similarity Joins for cloud systems. This paper focuses on the study, design and implementation techniques of cloudbased Similarity Joins. We present MRSimJoin, a MapReduce based algorithm to efficiently solve the Similarity Join problem. This algorithm efficiently partitions and distributes the data until the subsets are small enough to be processed in a single node. The proposed algorithm is general enough to be used with data that lies in any metric space. The algorithm can also be used with multiple data types, e.g., numerical data, vector data, text, etc. We present multiple guidelines to implement the algorithm in Hadoop, a highly used open-source cloud system. The extensive experimental evaluation of the implemented operation shows that it has very good execution time and scalability properties.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SIGMOD RWE Review ”Efficient Parallel Set-Similarity Joins Using MapReduce”

This document is a review report on the paper ”Efficient Parallel Set-Similarity Joins Using MapReduce” by R. Vernica, M. Carey, C. Li by Sigmod’s 2010 Repeatability and Workability Evaluation Committee. In this section the provided resources (code, data sets, setup information) and hardware setups of the authors and reviewers are discussed. Detailed information on all experiments that the revi...

متن کامل

Comparing MapReduce-Based k-NN Similarity Joins on Hadoop for High-Dimensional Data

Similarity joins represent a useful operator for data mining, data analysis and data exploration applications. With the exponential growth of data to be analyzed, distributed approaches like MapReduce are required. So far, the state-of-the-art similarity join approaches based on MapReduce mainly focused on the processing of low-dimensional vector data. In this paper, we revisit and investigate ...

متن کامل

Efficient and Scalable Graph Similarity Joins in MapReduce

Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. This paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their edit distances are no larger than a given thres...

متن کامل

MapReduce Based Personalized Locality Sensitive Hashing for Similarity Joins on Large Scale Data

Locality Sensitive Hashing (LSH) has been proposed as an efficient technique for similarity joins for high dimensional data. The efficiency and approximation rate of LSH depend on the number of generated false positive instances and false negative instances. In many domains, reducing the number of false positives is crucial. Furthermore, in some application scenarios, balancing false positives ...

متن کامل

Heads-Join: Efficient Earth Mover's Distance Similarity Joins on Hadoop

The Earth Mover’s Distance (EMD) similarity join has a number of important applications such as near duplicate image retrieval and distributed based pattern analysis. However, the computational cost of EMD is super cubic and consequently the EMD similarity join operation is prohibitive for datasets of even medium size. We propose to employ the Hadoop platform to speed up the operation. Simply p...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Technical Report: MapReduce-based Similarity Joins

نویسندگان

چکیده

منابع مشابه

SIGMOD RWE Review ”Efficient Parallel Set-Similarity Joins Using MapReduce”

Comparing MapReduce-Based k-NN Similarity Joins on Hadoop for High-Dimensional Data

Efficient and Scalable Graph Similarity Joins in MapReduce

MapReduce Based Personalized Locality Sensitive Hashing for Similarity Joins on Large Scale Data

Heads-Join: Efficient Earth Mover's Distance Similarity Joins on Hadoop

عنوان ژورنال:

اشتراک گذاری